7 research outputs found
Automatic identification of relevant chemical compounds from patents
In commercial research and development projects, public disclosure of new chemical
compounds often takes place in patents. Only a small proportion of these compounds
are published in journals, usually a few years after the patent. Patent authorities make
available the patents but do not provide systematic continuous chemical annotations.
Content databases such as Elsevier’s Reaxys provide such services mostly based on
manual excerptions, which are time-consuming and costly. Automatic text-mining
approaches help overcome some of the limitations of the manual process. Different
text-mining approaches exist to extract chemical entities from patents. The majority
of them have been developed using sub-sections of patent documents and focus on
mentions of compounds. Less attention has been given to relevancy of a compound in a
patent. Relevancy of a compound to a patent is based on the patent’s context. A relevant
compound plays a major role within a patent. Identification of relevant compounds
reduces the size of the extracted data and improves the usefulness of patent resources
(e.g. supports identifying the main compounds). Annotators of databases like Reaxys
only annotate relevant compounds. In this study, we design an automated system
that extracts chemical entities from patents and classifies their relevance. The goldstandard set contained 18 789 chemical entity annotations. Of these, 10% were relevant
compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition
system was based on proprietary tools. The performance (F-score) of the system on
compound recognition was 84% on the development set and 86% on the test set. The
relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and
classify their relevance with high performance. This enables the extension of the Reaxys
database by means of automation
The CHEMDNER corpus of chemicals and drugs and its annotation principles
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one
of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large,
manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison
of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000
PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry
literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER
corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was
manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family,
formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was
measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the
CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also
mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention
recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions
from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been
generated as well. We propose a standard for required minimum information about entity annotations for the
construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation
guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus
The CHEMDNER corpus of chemicals and drugs and its annotation principles
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus
Limited clinical relevance of imaging techniques in the follow-up of patients with advanced chronic lymphocytic leukemia: results of a meta-analysis
The clinical value of imaging is well established for the follow-up of many lymphoid malignancies but not for chronic lymphocytic leukemia (CLL). A meta-analysis was performed with the dataset of 3 German CLL Study Group phase 3 trials (CLL4, CLL5, and CLL8) that included 1372 patients receiving first-line therapy for CLL. Response as well as progression during follow-up was reassessed according to the National Cancer Institute Working Group1996 criteria. A total of 481 events were counted as progressive disease during treatment or follow-up. Of these, 372 progressions (77%) were detected by clinical symptoms or blood counts. Computed tomography (CT) scans or ultrasound were relevant in 44 and 29 cases (9% and 6%), respectively. The decision for relapse treatment was determined by CT scan or ultrasound results in only 2 of 176 patients (1%). CT scan results had an impact on the prognosis of patients in complete remission only after the administration of conventional chemotherapy but not after chemoimmunotherapy. In conclusion, physical examination and blood count remain the methods of choice for staging and clinical follow-up of patients with CLL as recommended by the International Workshop on Chronic Lymphocytic Leukemia 2008 guidelines. These trials are registered at http://www.isrctn.org as ISRCTN 75653261 and ISRCTN 36294212 and at http://www.clinicaltrials.gov as NCT00281918